SlideShare a Scribd company logo
Natural Language Processing
             Rada Mihalcea




Fall 2011
Any Light at The End of The Tunnel?




•   Yahoo, Google, Microsoft  Information Retrieval
•   Monster.com, HotJobs.com (Job finders)  Information Extraction +
    Information Retrieval
•   Systran powers Babelfish  Machine Translation
•   Ask Jeeves  Question Answering
•   Myspace, Facebook, Blogspot  Processing of User-Generated Content
•   Tools for “business intelligence”
•   All “Big Guys” have (several) strong NLP research labs:
     – IBM, Microsoft, AT&T, Xerox, Sun, etc.
•   Academia: research in an university environment
Why Natural Language Processing ?
• Huge amounts of data        •   Classify text into categories
                              •   Index and search large texts
   – Internet = at least 20
                              •   Automatic translation
     billions pages
                              •   Speech understanding
   – Intranet                      – Understand phone conversations
• Applications for            • Information extraction
                                   – Extract useful information from resumes
  processing large
                              • Automatic summarization
  amounts of texts                 – Condense 1 book into 1 page
  require NLP expertise       • Question answering
                              • Knowledge acquisition
                              • Text generations / dialogues
Natural?
• Natural Language?
   – Refers to the language spoken by people, e.g. English,
     Japanese, Swahili, as opposed to artificial languages, like
     C++, Java, etc.
• Natural Language Processing
   – Applications that deal with natural language in a way or
     another
• [Computational Linguistics
   – Doing linguistics on computers
   – More on the linguistic side than NLP, but closely related ]
Why Natural Language Processing?
•   kJfmmfj mmmvvv nnnffn333
•   Uj iheale eleee mnster vensi credur
•   Baboi oi cestnitze
•   Coovoel2^ ekk; ldsllk lkdf vnnjfj?
•   Fgmflmllk mlfm kfre xnnn!
Computers Lack Knowledge!
• Computers “see” text in English the same you have
  seen the previous text!
• People have no trouble understanding language
   – Common sense knowledge
   – Reasoning capacity
   – Experience
• Computers have
   – No common sense knowledge
   – No reasoning capacity
Where does it fit in the CS taxonomy?
                        Computers

Databases     Artificial Intelligence       Algorithms      Networking


   Robotics       Natural Language Processing              Search


    Information          Machine               Language
     Retrieval          Translation             Analysis


                                        Semantics   Parsing
Linguistics Levels of Analysis
• Speech
• Written language
  –   Phonology: sounds / letters / pronunciation
  –   Morphology: the structure of words
  –   Syntax: how these sequences are structured
  –   Semantics: meaning of the strings
• Interaction between levels
Issues in Syntax
“the dog ate my homework” - Who did what?
1. Identify the part of speech (POS)
  Dog = noun ; ate = verb ; homework = noun
  English POS tagging: 95%


2. Identify collocations
    mother in law, hot dog
    Compositional versus non-compositional
    collocates
Issues in Syntax
• Shallow parsing:
  “the dog chased the bear”
  “the dog” “chased the bear”
  subject - predicate
  Identify basic structures
  NP-[the dog] VP-[chased the bear]
Issues in Syntax
• Full parsing: John loves Mary




  Help figuring out (automatically) questions like: Who did what
  and when?
More Issues in Syntax
• Anaphora Resolution:
“The dog entered my room. It scared me”

• Preposition Attachment
“I saw the man in the park with a telescope”
Issues in Semantics
•   Understand language! How?
•   “plant” = industrial plant
•   “plant” = living organism
•   Words are ambiguous
•   Importance of semantics?
    – Machine Translation: wrong translations
    – Information Retrieval: wrong information
    – Anaphora Resolution: wrong referents
Why Semantics?
• The sea is at the home for billions factories and
  animals

• The sea is home to million of plants and
  animals
• English  French [commercial MT system]
• Le mer est a la maison de billion des usines et
  des animaux
• French  English
Issues in Semantics
• How to learn the meaning of words?
• From dictionaries:
  plant, works, industrial plant -- (buildings for carrying on
  industrial labor; "they built a large plant to manufacture
  automobiles")
  plant, flora, plant life -- (a living organism lacking the power of
  locomotion)
They are producing about 1,000 automobiles in the new plant
The sea flora consists in 1,000 different plant species
The plant was close to the farm of animals.
Issues in Semantics
• Learn from annotated examples:
  – Assume 100 examples containing “plant”
    previously tagged by a human
  – Train a learning algorithm
  – How to choose the learning algorithm?
  – How to obtain the 100 tagged examples?
Issues in Information Extraction
• “There was a group of about 8-9 people close to
  the entrance on Highway 75”
• Who? “8-9 people”
• Where? “highway 75”

• Extract information
• Detect new patterns:
  – Detect hacking / hidden information / etc.
• Gov./mil. puts lots of money put into IE
  research
Issues in Information Retrieval
• General model:
   – A huge collection of texts
   – A query
• Task: find documents that are relevant to the given
  query
• How? Create an index, like the index in a book
• More …
   – Vector-space models
   – Boolean models
• Examples: Google, Yahoo, Altavista, etc.
Issues in Information Retrieval
•   Retrieve specific information
•   Question Answering
•   “What is the height of mount Everest?”
•   11,000 feet
Issues in Information Retrieval
• Find information across languages!
• Cross Language Information Retrieval
• “What is the minimum age requirement for car
  rental in Italy?”
• Search also Italian texts for “eta minima per
  noleggio macchine”
• Integrate large number of languages
• Integrate into performant IR engines
Issues in Machine Translations
• Text to Text Machine Translations
• Speech to Speech Machine Translations

• Most of the work has addressed pairs of widely
  spread languages like English-French, English-
  Chinese
Issues in Machine Translations
• How to translate text?
  – Learn from previously translated data
 Need parallel corpora
• French-English, Chinese-English have the
  Hansards
• Reasonable translations
• Chinese-Hindi – no such tools available today!
Even More
• Discourse
• Summarization
• Subjectivity and sentiment analysis
• Text generation, dialog [pass the Turing test
  for some million dollars] – Loebner prize
• Knowledge acquisition [how to get that
  common sense knowledge]
• Speech processing
What will we study this semester?
• Intro to Perl
   – Great great for text processing
   – Fast: one person can do the work of ten others
   – Easy to pick up
• Some linguistic basics
   – Structure of English
   – Parts of speech, phrases, parsing
• Morphology
• N-grams
   – Also multi-word expressions
• Part of speech tagging
• Syntactic parsing
• Semantics
   – Word sense disambiguation
   – Semantic relations
What will we study this semester?
•   Information Retrieval
•   Question answering
•   Text classification
•   Text summarization
•   Sentiment analysis

• Depending on time, we may touch on
    –   Speech recognition
    –   Dialogue
    –   Text generation
    –   Other topics of your interest
Administrivia
•   Instructor: Rada Mihalcea, F228, rada@cs.unt.edu
•   Class meetings: TTh 11-12:20pm
•   Office hours: TTh 4:00-5:00pm
•   TA: TBA
•   Textbook: Speech and Language Processing, by Jurafsky
    and Martin (2nd edition)
    – Recommended: Statistical Methods in NLP, by Manning and
      Schutze
• Other readings (papers) may be assigned throughout the
  semester
• Grading: Assignments, 2 exams, term project
    – Late submission policy for assignments: can submit up to three days late,
      with 10% penalty / day

More Related Content

PPTX
Natural language processing
PPT
PDF
NOVA Data Science Meetup 1/19/2017 - Presentation 2
PPTX
natural language processing help at myassignmenthelp.net
PPTX
Introduction to nlp
PPTX
Introduction to natural language processing (NLP)
PDF
WomenTech_Event
PPT
Natural Language Processing
Natural language processing
NOVA Data Science Meetup 1/19/2017 - Presentation 2
natural language processing help at myassignmenthelp.net
Introduction to nlp
Introduction to natural language processing (NLP)
WomenTech_Event
Natural Language Processing

What's hot (20)

PDF
Natural Language Processing: L01 introduction
PDF
Computational linguistics
PDF
Natural Language Processing (NLP)
PPTX
Lecture 1: Semantic Analysis in Language Technology
DOCX
Computational linguistics
PPTX
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
PPTX
Natural language processing
PDF
Introduction to Natural Language Processing (NLP)
PPTX
Introduction to Natural Language Processing
PPT
Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký
PPTX
The Role of Natural Language Processing in Information Retrieval
DOCX
Corpus Linguistics
PDF
Natural language processing and its application in ai
PPTX
Natural language processing
PPTX
COMPUTATIONAL LINGUISTICS
PPT
Big Data and Natural Language Processing
PPTX
Computational linguistics
PPT
Natural Language Processing for Games Research
PDF
自然言語処理@春の情報処理祭
Natural Language Processing: L01 introduction
Computational linguistics
Natural Language Processing (NLP)
Lecture 1: Semantic Analysis in Language Technology
Computational linguistics
A Knowledge-Light Approach to Luo Machine Translation and Part-of-Speech Tagging
Natural language processing
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing
Dolování dat z řeči pro bezpečnostní aplikace - Jan Černocký
The Role of Natural Language Processing in Information Retrieval
Corpus Linguistics
Natural language processing and its application in ai
Natural language processing
COMPUTATIONAL LINGUISTICS
Big Data and Natural Language Processing
Computational linguistics
Natural Language Processing for Games Research
自然言語処理@春の情報処理祭
Ad

Similar to Intro (20)

PPT
Natural_Language_Processing_1.ppt
PPT
Intro 2 document
PPT
cs626-449-lect1-intro-2009-7-23-jshdih.ppt
PPTX
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)
PPT
Natural language procssing
PDF
Natural Language Processing from Object Automation
PPTX
operating system notes for II year IV semester students
PDF
NL Context Understanding 23(6)
PDF
Lesson 40
PDF
AI Lesson 40
PPTX
NATURAL LANGUAGE PROCESSING AA PPT1.pptx
PPTX
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
PPT
NLP Introduction.ppt machine learning presentation
PPTX
Natural Language Processing (NLP)
PPT
1 Introduction.ppt
PDF
Ijetcas14 458
PPTX
Module 1-NLP (2).pptxiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
PPTX
Natural Language Processing
PPT
1004-nlp.ppt
PPTX
Artificial Intelligence Notes Unit 4
Natural_Language_Processing_1.ppt
Intro 2 document
cs626-449-lect1-intro-2009-7-23-jshdih.ppt
6CS4_AI_Unit-5 @zammers.pptx(for artificial intelligence)
Natural language procssing
Natural Language Processing from Object Automation
operating system notes for II year IV semester students
NL Context Understanding 23(6)
Lesson 40
AI Lesson 40
NATURAL LANGUAGE PROCESSING AA PPT1.pptx
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
NLP Introduction.ppt machine learning presentation
Natural Language Processing (NLP)
1 Introduction.ppt
Ijetcas14 458
Module 1-NLP (2).pptxiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
Natural Language Processing
1004-nlp.ppt
Artificial Intelligence Notes Unit 4
Ad

Recently uploaded (20)

PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Approach and Philosophy of On baking technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Modernizing your data center with Dell and AMD
PDF
Electronic commerce courselecture one. Pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PPT
Teaching material agriculture food technology
PDF
Encapsulation_ Review paper, used for researhc scholars
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Advanced methodologies resolving dimensionality complications for autism neur...
Approach and Philosophy of On baking technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Modernizing your data center with Dell and AMD
Electronic commerce courselecture one. Pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Teaching material agriculture food technology
Encapsulation_ Review paper, used for researhc scholars
20250228 LYD VKU AI Blended-Learning.pptx
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Dropbox Q2 2025 Financial Results & Investor Presentation
Chapter 3 Spatial Domain Image Processing.pdf

Intro

  • 1. Natural Language Processing Rada Mihalcea Fall 2011
  • 2. Any Light at The End of The Tunnel? • Yahoo, Google, Microsoft  Information Retrieval • Monster.com, HotJobs.com (Job finders)  Information Extraction + Information Retrieval • Systran powers Babelfish  Machine Translation • Ask Jeeves  Question Answering • Myspace, Facebook, Blogspot  Processing of User-Generated Content • Tools for “business intelligence” • All “Big Guys” have (several) strong NLP research labs: – IBM, Microsoft, AT&T, Xerox, Sun, etc. • Academia: research in an university environment
  • 3. Why Natural Language Processing ? • Huge amounts of data • Classify text into categories • Index and search large texts – Internet = at least 20 • Automatic translation billions pages • Speech understanding – Intranet – Understand phone conversations • Applications for • Information extraction – Extract useful information from resumes processing large • Automatic summarization amounts of texts – Condense 1 book into 1 page require NLP expertise • Question answering • Knowledge acquisition • Text generations / dialogues
  • 4. Natural? • Natural Language? – Refers to the language spoken by people, e.g. English, Japanese, Swahili, as opposed to artificial languages, like C++, Java, etc. • Natural Language Processing – Applications that deal with natural language in a way or another • [Computational Linguistics – Doing linguistics on computers – More on the linguistic side than NLP, but closely related ]
  • 5. Why Natural Language Processing? • kJfmmfj mmmvvv nnnffn333 • Uj iheale eleee mnster vensi credur • Baboi oi cestnitze • Coovoel2^ ekk; ldsllk lkdf vnnjfj? • Fgmflmllk mlfm kfre xnnn!
  • 6. Computers Lack Knowledge! • Computers “see” text in English the same you have seen the previous text! • People have no trouble understanding language – Common sense knowledge – Reasoning capacity – Experience • Computers have – No common sense knowledge – No reasoning capacity
  • 7. Where does it fit in the CS taxonomy? Computers Databases Artificial Intelligence Algorithms Networking Robotics Natural Language Processing Search Information Machine Language Retrieval Translation Analysis Semantics Parsing
  • 8. Linguistics Levels of Analysis • Speech • Written language – Phonology: sounds / letters / pronunciation – Morphology: the structure of words – Syntax: how these sequences are structured – Semantics: meaning of the strings • Interaction between levels
  • 9. Issues in Syntax “the dog ate my homework” - Who did what? 1. Identify the part of speech (POS) Dog = noun ; ate = verb ; homework = noun English POS tagging: 95% 2. Identify collocations mother in law, hot dog Compositional versus non-compositional collocates
  • 10. Issues in Syntax • Shallow parsing: “the dog chased the bear” “the dog” “chased the bear” subject - predicate Identify basic structures NP-[the dog] VP-[chased the bear]
  • 11. Issues in Syntax • Full parsing: John loves Mary Help figuring out (automatically) questions like: Who did what and when?
  • 12. More Issues in Syntax • Anaphora Resolution: “The dog entered my room. It scared me” • Preposition Attachment “I saw the man in the park with a telescope”
  • 13. Issues in Semantics • Understand language! How? • “plant” = industrial plant • “plant” = living organism • Words are ambiguous • Importance of semantics? – Machine Translation: wrong translations – Information Retrieval: wrong information – Anaphora Resolution: wrong referents
  • 14. Why Semantics? • The sea is at the home for billions factories and animals • The sea is home to million of plants and animals • English  French [commercial MT system] • Le mer est a la maison de billion des usines et des animaux • French  English
  • 15. Issues in Semantics • How to learn the meaning of words? • From dictionaries: plant, works, industrial plant -- (buildings for carrying on industrial labor; "they built a large plant to manufacture automobiles") plant, flora, plant life -- (a living organism lacking the power of locomotion) They are producing about 1,000 automobiles in the new plant The sea flora consists in 1,000 different plant species The plant was close to the farm of animals.
  • 16. Issues in Semantics • Learn from annotated examples: – Assume 100 examples containing “plant” previously tagged by a human – Train a learning algorithm – How to choose the learning algorithm? – How to obtain the 100 tagged examples?
  • 17. Issues in Information Extraction • “There was a group of about 8-9 people close to the entrance on Highway 75” • Who? “8-9 people” • Where? “highway 75” • Extract information • Detect new patterns: – Detect hacking / hidden information / etc. • Gov./mil. puts lots of money put into IE research
  • 18. Issues in Information Retrieval • General model: – A huge collection of texts – A query • Task: find documents that are relevant to the given query • How? Create an index, like the index in a book • More … – Vector-space models – Boolean models • Examples: Google, Yahoo, Altavista, etc.
  • 19. Issues in Information Retrieval • Retrieve specific information • Question Answering • “What is the height of mount Everest?” • 11,000 feet
  • 20. Issues in Information Retrieval • Find information across languages! • Cross Language Information Retrieval • “What is the minimum age requirement for car rental in Italy?” • Search also Italian texts for “eta minima per noleggio macchine” • Integrate large number of languages • Integrate into performant IR engines
  • 21. Issues in Machine Translations • Text to Text Machine Translations • Speech to Speech Machine Translations • Most of the work has addressed pairs of widely spread languages like English-French, English- Chinese
  • 22. Issues in Machine Translations • How to translate text? – Learn from previously translated data  Need parallel corpora • French-English, Chinese-English have the Hansards • Reasonable translations • Chinese-Hindi – no such tools available today!
  • 23. Even More • Discourse • Summarization • Subjectivity and sentiment analysis • Text generation, dialog [pass the Turing test for some million dollars] – Loebner prize • Knowledge acquisition [how to get that common sense knowledge] • Speech processing
  • 24. What will we study this semester? • Intro to Perl – Great great for text processing – Fast: one person can do the work of ten others – Easy to pick up • Some linguistic basics – Structure of English – Parts of speech, phrases, parsing • Morphology • N-grams – Also multi-word expressions • Part of speech tagging • Syntactic parsing • Semantics – Word sense disambiguation – Semantic relations
  • 25. What will we study this semester? • Information Retrieval • Question answering • Text classification • Text summarization • Sentiment analysis • Depending on time, we may touch on – Speech recognition – Dialogue – Text generation – Other topics of your interest
  • 26. Administrivia • Instructor: Rada Mihalcea, F228, rada@cs.unt.edu • Class meetings: TTh 11-12:20pm • Office hours: TTh 4:00-5:00pm • TA: TBA • Textbook: Speech and Language Processing, by Jurafsky and Martin (2nd edition) – Recommended: Statistical Methods in NLP, by Manning and Schutze • Other readings (papers) may be assigned throughout the semester • Grading: Assignments, 2 exams, term project – Late submission policy for assignments: can submit up to three days late, with 10% penalty / day